Currently, 6.5 million adult Americans are affected by heart failure. Heart failure has a 5 year mortality rate, despite later-stage heart failure with reduced ejection fraction (HFrEF) showing diverse etiologies and genetic contributions they are still considered to evolve via a final common pathway. Current therapies are relatively indifferent to disease etiology and treat HFrEF due to ischemic cardiomyopathy (ICM) the same as dilated cardiomyopathy (DCM), even though ICM has a much worse prognosis than DCM. This can be an indication of an incomplete understanding of the different biological mechanisms contributing to HFrEF (Sweet et al. 2018).
Throughout the course of the term the data set being studied involved 64 human left ventricular samples, 37 of which are from DCM, 13 from ICM, and 14 from non-failing hearts (NF) were studied. The study in which the data originated from aimed to uncover common and etiology-specific gene signatures between the three cohorts (Sweet et al. 2018).
This data set was downloaded from the Gene Expression Omnibus (GEO) and has a GEO accession of GSE116250.
In Assignment 1, the data set has been normalized using RPKM and genes with mean RPKM across all samples less than five have been filtered out to remove some noise. The decision to use RPMK as the normalization method was that of the authors since the data set posted to GEO was in RPKM. Prior to removing lowly expressed genes with mean RPKM less than 5 there were 57974 genes. After removing these lowly expressed genes 12381 genes remained.
It’s important note the normalization and data processing methods were not clearly described by the authors in their original paper. In Assignment 1, I performed TMM normalization to see if the authors performed any normalization between samples. The hypothesis was if the data has been normalized between samples as well then the plots before TMM normalization and after TMM normalization should be the same. In. the density plot post TMM normalization there was a large spike at 0 essentially making all the data at one point. This density plot does not make sense. Furthermore, this is a drastic change to the original data and is undesired.
Due to the results seen from the plots after TMM normalization and lack of information in the paper about normalization methods used other than RPKM, it was concluded the data posted is already normalized. Further normalization of this data results in dramatic changes to the data that isn’t desirable.
In Assignment 2, the normalized expression data was ranked based on their differential expression and a threshold over-representation analysis (ORA) was performed highlighting the dominant themes in the gene list. It was found the dominant themes were mitochondria, metabolic processes, and gene expression such as in transcription and translation.
This is consistent with the findings of (Sweet et al. 2018). Additionally, a study by (Rosenbaum, Agre, and Pereira 2020) detailed ten genes that have most commonly been implicated in DCM are mostly involved in gene expression such as in the transcription and translation process. A study looking at candidate genes in ICM had similar results where they found that in ICM enriched pathways were found in mitochondria and gene expression such as in post-translational modification (Dang et al. 2020).
In this final assignment a non-thresholded gene set enrichment analysis (GSEA) will be conducted followed by a visualization of the results in Cytoscape. with the use of the EnrichmentMap pipeline. Finally, a post analysis will be conducted examining potential drugs being used in the treatment of these diseases.
# Install required packages
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
if (!requireNamespace("tibble", quietly = TRUE)) {
install.packages("tibble")
}
if (!requireNamespace("dplyr", quietly = TRUE)) {
install.packages("dplyr")
}
if (!requireNamespace("tidyr", quietly = TRUE)) {
install.packages("tidyr")
}
if (!requireNamespace("readr", quietly = TRUE)) {
install.packages("readr")
}
if (!requireNamespace("fgsea", quietly = TRUE)) {
BiocManager::install("fgsea")
}
suppressPackageStartupMessages({
library("tibble")
library("dplyr")
library("fgsea")
})
First, let’s get the list of genes that were calculated to be differentially expressed in Assignment 2 and get their ranks to make a ranked gene list.
# load in the ranked gene list
rankFile <- "ranked_genes.rds"
if (file.exists(rankFile)) {
rankedGenes <- readRDS(rankFile)
# filter to only the gene name and rank
rankedGenes <- rankedGenes[, c("hgnc_symbol", "rank")]
} else {
# if the ranked gene list doesn't exist yet, load in the output hits data from assignment 2
outputHitsFileName <- "output_hits.rds"
outputHits <- readRDS(outputHitsFileName)
outputHits <- tibble::as_tibble(outputHits)
# calculate the rank
outputHits$rank <- -log(outputHits$P.Value, base = 10) * sign(outputHits$logFC)
# save for future use
saveRDS(outputHits, rankFile)
# only keep gene symbol and rank column
rankedGenes <- dplyr::select(outputHits, hgnc_symbol, rank)
}
Next, let’s conduct a no-thresholded gene set enrichment analysis.
# load in the gene set to use from Bader labs
pathways <- fgsea::gmtPathways("Human_GOBP_AllPathways_no_GO_iea_March_02_2023_symbol.gmt")
# conduct gsea
gseaResults <- fgsea::fgsea(pathways = pathways,
stats = tibble::deframe(rankedGenes),
maxSize = 200,
minSize = 15,
nperm = 1000)
gseaResults <- tibble::as_tibble(gseaResults)
Let’s examine the top few gene sets overrepresented at the top of the ranked list.
# order results in descending value of NES (enrichment score normalized to mean enrichment of random samples of the same size)
gseaResultsTop <- gseaResults %>% dplyr::arrange(desc(NES))
# view the results associated with the top of the list with largest NES
head(gseaResultsTop)
Table 1: Results from GSEA of top 6 gene sets overrepresented at the top of the ranked list.
Next, let’s look at the results of the top 6 gene sets overrepresented at the bottom of the ranked list.
# view the results associated with the bottom of the list with lowest NES
gseaResultsBottom <- gseaResults %>% dplyr::arrange(NES)
head(gseaResultsBottom)
Table 2: Results from GSEA of top 6 gene sets overrepresented at the bottom of the ranked list.
1. What method did you use? What genesets did you use? Make sure to specify versions and cite your methods.
I decided to use the function fgsea from the R package
fgsea for the gene set enrichment analysis. It uses a fast
algorithm for the GSEA allowing more permutations in a shorter amount of
computation time (Korotkevich et al.
2016). It allows for the GSEA to be computed on the fly in a
short amount of time and when compared to the GSEA Java app it gives
similar results.
The gene sets I used were from the Bader Lab dated March 2, 2023. They include all human pathways from GO Biological Process and they do not have inferred electronic annotations (BaderLab 2023). No electronic annotations were chosen to reduce noise and ensure higher accuracy of the results.
2. Summarize your enrichment results.
From the GSEA the gene sets associated with the genes overrepresented at the top of the ranked list are more correlated with mitochondrial gene expression such as translation initiation and translation termination, seen in Table 1. In Table 2, it shows that the genes sets associated with the genes overrepresented at the bottom of the ranked list are more correlated with metabolic processes like glycosaminoglycan metabolism and proteoglycan metabolism.
This indicates that those with DCM and ICM experience more mitochondrial dysfunction. This is consistent with what the authors have found in their paper (Sweet et al. 2018). It also shows that there are common pathways of mitochondrial dysfunction in DCM and ICM. For the gene sets enriched at the bottom of the ranked list it indicates there is enrichment in metabolic processes with the top one being glycosaminoglycan metabolism. Glycosaminoglycans are key players in organizing the extra cellular matrix, cell-matrix interactions and signal regulations (Ricard-Blum and Perez 2022). Literature shows that the extracellular matrix is crucial in cardiac homeostasis as it provides structural support and facilitates key signals to cardiomyocytes, vascular cells, and interstitial cells. Changes in such can potentially cause heart failure (Frangogiannis 2019). Thus gene sets more correlated with bottom of the ranked list are related to the extracellular matrix and cell-cell interactions.
3. How do these results compare to the results from the thresholded analysis in Assignment #2. Compare qualitatively. Is this a straight forward comparison? Why or why not?
The results for the up regulated genes in Assignment 2 were:
upFilePath <- "thresholded_gsea_up.rds"
a2UpResults <- readRDS(upFilePath)
head(a2UpResults)
Table 3: Enriched gene sets from thresholded GSEA (threshold. 0.05) using up regulated genes only
The results for the down regulated genes in Assignment 2 were:
downFilePath <- "thresholded_gsea_down.rds"
a2DownResults <- readRDS(downFilePath)
head(a2DownResults)
Table 4: Enriched gene sets from thresholded GSEA (threshold. 0.05) using down regulated genes only.
The threshold for both up and down was 0.05 and the FDR was used for multiple hypothesis testing correction.
Based on the results from Assignment 2 the main themes for the up regulated genes is relatively consistent. The top two results in Table 3, thresholded GSEA, are mitochondrial gene expression and in Table 1, non-thresholded GSEA, all the results except the second are for mitochondrial translation. However, there are some differences in that in the thresholded analysis the results seem more general while in the non-thresholded analysis they seem a bit more specific in that they are mainly localized in the mitochondria and relate to translation. For example, although most results relate to translation there are results for non-mitochondrial translation such as cytoplasmic translation.
For the down-regulated genes there is more variation in the results. In Table 2, the non-thresholded analysis mainly shows themes in metabolism such as glycosaminoglycan, proteoglycan, and amine metabolism. Whereas in Table 4, the thresholded analysis, shows dominant themes in cell organization and cell structure such as lysosome organization.
Although the results aren’t exactly the same I don’t believe they are necessarily in disagreement. They both align with similar more general themes. For example, the up regulated genes in both analysis can be said to be involved in translation while the down regulated genes can be involved in metabolism of molecules needed for cell structure and organization.
These results still align with the experiment conditions, namely the samples and disease being studied. I don’t think this is a straight forward comparison, because different gene sets were used in these two analyses. It’s possible that some of the pathways are different because they weren’t included in the other ananlysis’s gene set.
1. Create an enrichment map - how many nodes and how many edges in the resulting map? What thresholds were used to create this map? Make sure to record all thresholds. Include a screenshot of your network prior to manual layout.
There are 4304 nodes and 31962 edges in the resulting map. An FDR q-value of 1.0 was used and a p-value of 0.05. A more permissive q-value was chosen to allow for more results past correction. Initially when the default FDR q-value was used, 0.1, there were 7 nodes and 10 edges.
Figure 1: Initial enrichment map prior to manual layout
2. Annotate your network - what parameters did you use to annotate the network. If you are using the default parameters make sure to list them as well.
The AutoAnnotate app that came with the EnrichmentMap pipeline was used to annotate the network. The parameters used were the MCL Cluster algorithm and annotate with a maximum of 50 annotations to allow for a more readable network. The label column was GS_DESCR.
3. Make a publication ready figure - include this figure with proper legends in your notebook.
Below in Figure 2 is the publication ready figure. Subsequent figures, Figures 3 to 5 are closeups of specific parts of the network and legend respectively.
Figure 2: Enrichment map annotated with AutoAnnotate
Figure 3: Close up of the main clusters in annotated enrichment map
Figure 4: Close up of smaller clusters in annotated enrichment map
Figure 5: Closeup of the legend for the annotated enrichment map.
4. Collapse your network to a theme network. What are the major themes present in this analysis? Do they fit with the model? Are there any novel pathways or themes?
Below in Figure 6 is the collapsed theme network generated from using “Collapse All” in AutoAnnotate.
Figure 6: Collapsed theme network generated from using the “Collapse All” function in AutoAnnotate.
The major themes can be seen in Figure 7. They include pathways involving the human epidermal growth factor (EGF) and anti arrhythmic drugs. Considering this study looked at samples with heart disease (DCM, ICM, and NF) it is not surprising to find themes regarding arrhythmias of the heart. Having these heart diseases can cause irregular heart rhythms.
Regarding epidermal growth factors, this is slightly a more novel pathway. In previous analyses and the original paper it mentions the main pathways in DCM and ICM are involved in mitochondria dysfunction, cell-cell interactions and immune response (Sweet et al. 2018). However, this is still consistent with the samples/disease being studied. DCM and ICM are subtypes of cardiomyopathy which is a heart disease making it harder for the heart to pump blood to the rest of the body. A potential reason that can explain why it is more difficult for those with DCM and ICM to pump blood to the rest of the body could be related to dysregulation in epidermal growth factors. This dysregulation could affect the composition of the heart and as a result its function as a muscle.
Figure 7: Major themes in the analysis
1. Do the enrichment results support conclusions or mechanism discussed in the original paper? How do these results differ from the results you got from Assignment #2 thresholded methods
Yes, the enrichment results support the conclusions and mechanisms discussed in the original paper. The authors found that in DCM and ICM there were pathways enriched for mitochondria and dysregulation in these pathways contribute to mitochondrial dysfunction (Sweet et al. 2018). The authors also found some pathways enriched in metabolism such as oxidative phosphorylation and some of the results here in the non-thresholded analysis particularly with the down regulated genes saw the same enrichment in metabolic pathways. The exact metabolic pathways are different but still show the theme of metabolism.
The results here differ from the thresholded results in Assignment 2 in that the more specific themes are different. In the up regulated genes in Assignment 2 there were themes of gene expression, mainly translation, in mitochondria and elsewhere. While here in the non-thresholded analysis it shows themes of gene expression, mainly translation as well, but only in the mitochondria. They also both include a pathway for cellular redox reactions with the non-thresholded analysis being more specific in specifying hallmark reactive oxygen species pathway.
In the down-regulated genes the results are more variable. In the thresholded analysis of Assignment 2 they were mainly related to cell structure and organization such as lysosome organization and GPI anchor biosynthetic process. On the other hand, in the non-thresholded analysis they were mainly about metabolic processes for specific molecules like glycosaminoglycan metabolism and proteoglycan metabolism.
2. Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your result?
Yes, there is evidence from other publications to support some of the results that are seen. In a previous studies identifying candidate genes in ICM they found that some of the top significantly enriched pathways were related to gene expression such as post-translational modification and were involved in the mitochondria. The authors of that study also used the data set studied here to validate their results for some of the differentially expressed genes and found the results in this study were consistent with their findings (Dang et al. 2020). Publications also indicate DCM and other cardiomyopathies are associated with mitochondrial disorders, supporting the results seen here (Rosenbaum, Agre, and Pereira 2020).
The results here for the non-thresholded analysis show the top enriched gene set for the negatively ranked genes is diseases associated with glycosaminoglycan. Glycosaminoglycan takes part in organizing the cellular matrix, cell-matrix interactions, and signal regulations (Ricard-Blum and Perez 2022). It has been shown that changes to the extracellular matrix can be implicated in the pathogenisis of heart failure (Frangogiannis 2019). The authors of the original paper saw enrichment in pathways for cell-cell and cell-matrix pathways also (Sweet et al. 2018).
This evidence supports the result found here that gene expression especially in mitochondria and metabolism are major themes in the cardiomyopathies studied here.
For the post analysis, I chose to add drugs to my network. The drugs added were verapamil, lidocaine, mexiletine, sacubitril, valsartan, digoxin, atenolol, and nebivolol. I chose these drugs because they are common drugs prescribed to those with ICM and DCM (Mantziari et al. 2012). These drugs are also common in treating heart failure in general (MayoClinic 2023). In particular, I was interested in seeing verapamil since “verapamil action antiarrhythmic” was a significant cluster in the network and it’s used to treat arrhythmia, a common symptom of ICM and DCM (DrugBank 2023).
Figure 8: Network with 8 drugs added.
Figure 9: Zoomed in legend for network with drugs.
Of the 8 drugs added there were 5 found to be significant using the one sided Mann-Whitney test and a threshold of 0.8. They were atenolol, digoxin, mexiletine, nebivolol, and verapamil. In this network the orange edges represent the interactions a drug has with the pathways in the network. The one at the very bottom is verapamil. It has the most interactions compared to all the other drugs.
The other drugs, including verapamil, have interactions with other parts of the network but the main interactions target pathways related to heart contraction. Figure 10 shows the location and overall interactions of the 4 drugs excluding verapamil in yellow.
Figure 11 below is zoomed in on the group of pathways these drugs
interact the most with.
Figure 11: Pathways associated with heart contraction
These interactions make sense since those with DCM, ICM, and heart failure have trouble with heart contractions resulting in issues with blood flow and these drugs attempt to make it easier for the heart to circulate blood through the body. For example, atenolol is used to lower blood pressure (MayoClinic 2023).
Overall, the results in these three assignments support the findings of the authors and other studies that mitochondrial dysfunction is a common pathway between DCM and ICM. There is also evidence for enrichment in metabolism of molecules important to the extracellular matrix, cell-cell interactions, and cell-matrix interactions. Disruption to these pathways can cause heart failure. The drugs prescribed to those with DCM and ICM are also seen to interact with pathways of expected function relating to heart contraction.
The authors of the original paper did a comparison between each pair of cohorts as well to seek more fine differences between each cohort. This includes DCM vs ICM, DCM vs NF, and ICM vs NF. It would be interesting to do the analysis between each pair of cohorts and compare the results to the authors, however that would be quite time consuming. It is still interesting to see the larger themes found in these analyses match those of the authors.
The link to the corresponding journal entry can be found here.